Exercise 3 – Classification I
$$
% math spaces % N, naturals % Z, integers % Q, rationals % R, reals % C, complex % C, space of continuous functions % machine numbers % maximum error % counting / finite sets % set 0, 1 % set -1, 1 % unit interval % basic math stuff % x tilde % argmax % argmin % argmax with limits % argmin with limits% sign, signum % I, indicator % O, order % partial derivative % floor % ceiling % sums and products % summation from i=1 to n % summation from i=1 to m % summation from j=1 to p % summation from j=1 to p % summation from i=1 to k % summation from k=1 to g % summation from j=1 to g % mean from i=1 to n % mean from i=1 to n % mean from k=1 to g % product from i=1 to n % product from k=1 to g % product from j=1 to p % linear algebra % 1, unitvector % 0-vector % I, identity % diag, diagonal % tr, trace % span % <.,.>, scalarproduct % short pmatrix command % matrix A % error term for vectors % basic probability + stats % P, probability % E, expectation % Var, variance % Cov, covariance % Corr, correlation % N of the normal distribution % dist with i.i.d superscript
% … is distributed as …
% X, input space % Y, output space % set from 1 to n % set from 1 to p % set from 1 to g % P_xy % E_xy: Expectation over random variables xy % vector x (bold) % vector x-tilde (bold) % vector y (bold) % observation (x, y) % (x1, …, xp) % Design matrix % The set of all datasets % The set of all datasets of size n % D, data % D_n, data of size n % D_train, training set % D_test, test set % (x^i, y^i), i-th observation % {(x1,y1)), …, (xn,yn)}, data % Def. of the set of all datasets of size n % Def. of the set of all datasets % {x1, …, xn}, input data % {y1, …, yn}, input data % (y1, …, yn), vector of outcomes % x^i, i-th observed value of x% y^i, i-th observed value of y % (x1^i, …, xp^i), i-th observation vector % x_j, j-th feature % (x^1_j, …, x^n_j), j-th feature vector % Basis transformation function phi % Basis transformation of xi: phi^i := phi(xi)
%%%%%% ml - models general % lambda vector, hyperconfiguration vector % Lambda, space of all hpos % Inducer / Inducing algorithm % Set of all datasets times the hyperparameter space % Set of all datasets times the hyperparameter space % Inducer / Inducing algorithm % Inducer, inducing algorithm, learning algorithm
% continuous prediction function f % True underlying function (if a statistical model is assumed) % True underlying function (if a statistical model is assumed) % f(x), continuous prediction function % f with domain and co-domain % hypothesis space where f is from % Bayes-optimal model % Bayes-optimal model% f_j(x), discriminant component function % f hat, estimated prediction function % fhat(x) % f(x | theta) % f(x^(i)) % f(x^(i)) % f(x^(i) | theta) % fhat_D, estimate of f based on D % fhat_Dtrain, estimate of f based on D %model learned on Dn with hp lambda %model learned on D with hp lambda %model learned on Dn with optimal hp lambda %model learned on D with optimal hp lambda
% discrete prediction function h % h(x), discrete prediction function % h hat % hhat(x) % h(x | theta) % h(x^(i)) % h(x^(i) | theta) % Bayes-optimal classification model % Bayes-optimal classification model
% yhat % yhat for prediction of target % yhat^(i) for prediction of ith targiet
% theta % theta hat % theta vector % theta vector hat %% %theta learned on Dn with hp lambda %theta learned on D with hp lambda % min problem theta % argmin theta
% densities + probabilities % pdf of x % p % p(x) % pi(x|theta), pdf of x given theta % pi(x^i|theta), pdf of x given theta % pi(x^i), pdf of i-th x
% pdf of (x, y) % p(x, y) % p(x, y | theta) % p(x^(i), y^(i) | theta)
% pdf of x given y % p(x | y = k) % log p(x | y = k)% p(x^i | y = k)
% prior probabilities % pi_k, prior% log pi_k, log of the prior % Prior probability of parameter theta
% posterior probabilities % P(y = 1 | x), post. prob for y=1 % P(y = k | y), post. prob for y=k % pi with domain and co-domain % Bayes-optimal classification model % Bayes-optimal classification model % pi(x), P(y = 1 | x) % pi, bold, as vector % pi_k(x), P(y = k | x) % pi_k(x | theta), P(y = k | x, theta) % pi(x) hat, P(y = 1 | x) hat % pi_k(x) hat, P(y = k | x) hat % pi(x^(i)) with hat% pi_k(x^(i)) with hat % p(y | x, theta) % p(y^i |x^i, theta) % log p(y | x, theta) % log p(y^i |x^i, theta)
% probababilistic% Bayes rule % mean vector of class-k Gaussian (discr analysis)
% residual and margin % residual, stochastic % epsilon^i, residual, stochastic % residual, estimated % y f(x), margin % y^i f(x^i), margin % estimated covariance matrix % estimated covariance matrix for the j-th class
% ml - loss, risk, likelihood % L(y, f), loss function % L(y, pi), loss function % L(y, f(x)), loss function % loss of observation % loss with f parameterized % loss of observation with f parameterized % loss of observation with f parameterized % loss in classification % loss in classification % loss of observation in classification % loss with pi parameterized % loss of observation with pi parameterized % L(y, h(x)), loss function on discrete classes % L(r), loss defined on residual (reg) / margin (classif) % L1 loss % L2 loss % Bernoulli loss for -1, +1 encoding % Bernoulli loss for 0, 1 encoding % cross-entropy loss % Brier score % R, risk % R(f), risk % risk def (expected loss) % R(theta), risk % R_emp, empirical risk w/o factor 1 / n % R_emp, empirical risk w/ factor 1 / n % R_emp(f) % R_emp(theta) % R_reg, regularized risk % R_reg(theta) % R_reg(f) % hat R_reg(theta) % hat R_emp(theta) % L, likelihood % L(theta), likelihood % L(theta|x), likelihood % l, log-likelihood % l(theta), log-likelihood % l(theta|x), log-likelihood % training error % test error % avg training error
% lm % linear model % OLS estimator in LM
% resampling % size of the test set % size of the train set % size of the i-th test set % size of the i-th train set % index vector train data % index vector test data % index vector i-th train dataset % index vector i-th test dataset % D_train,i, i-th training set% D_test,i, i-th test set
% space of train indices of size n_train % space of train indices of size n_train % space of train indices of size n_test % output vector associated to index J % def of the output vector associated to index J % cali-J, set of all splits % (Jtrain_1,Jtest_1) …(Jtrain_B,Jtest_B) % Generalization error % GE % GE-hat % GE full % GE hat holdout % GE hat holdout i-th set % GE-hat(lam) % GE-hat_I,J,rho(lam) % GE-hat_I,J,rho(lam) % GE formal def % aggregate function % GE of a fitted model % GEh of a fitted model % GE of a fitted model wrt loss L % pointwise loss of fitted model% GE of a fitted model % GE of inducer % GE indexed with data % expected GE % expectation wrt data of size n
% performance measure % perf. measure derived from pointwise loss % matrix of prediction scores % i-th row vector of the predscore mat % predscore mat idxvec J % predscore mat idxvec J and model f % predscore mat idxvec Jtest and model f hat % predscore mat idxvec Jtest and model f% predscore mat i-th idxvec Jtest and model f % def of predscore mat idxvec J and model f % Set of all datasets times HP space
% ml - ROC % no. of positive instances % no. of negative instances % proportion negative instances % proportion negative instances % true/false pos/neg: % true pos % false pos (fp taken for partial derivs) % true neg % false neg
% ml - trees, extra trees % (Parent) node N % node N_k % Left node N_1 % Right node N_2 % class probability node N % estimated class probability node N % estimated class probability left node% estimated class probability right node
% ml - bagging, random forest % baselearner, default m % estimated base learner, default m % baselearner, default m % ensembled predictor % estimated ensembled predictor % ambiguity/instability of ensemble % weight of basemodel m% weight of basemodel m with hat % last baselearner
% ml - boosting % prediction in iteration m % prediction in iteration m % prediction m-1 % prediction m-1 % weighted in-sample misclassification rate % weight vector of basemodel m % weight of obs i of basemodel m % parameters of basemodel m % parameters of basemodel m with hat % baselearner, default m % ensemble % pseudo residuals % pseudo residuals % terminal-region % terminal-region % mean, terminal-regions % mean, terminal-regions with hat% mean, terminal-regions
% ml - boosting iml lecture % theta % BL j with theta % BL j with theta $$
Exercise 1: Logistic vs softmax regression
Binary logistic regression is a special case of multiclass logistic, or softmax, regression. The softmax function is the multiclass analogue to the logistic function, transforming scores \(\boldsymbol{\theta}^\top \mathbf{x}\) to values in the range [0, 1] that sum to one. The softmax function is defined as:
\[ \pi_k(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_k^\top \mathbf{x})}{\sum_{j=1}^{g} \exp(\boldsymbol{\theta}_j^\top \mathbf{x})}, k \in \{1,...,g\} \]
- Show that logistic and softmax regression are equivalent for \(g = 2\).
Solution
As we would expect, the two formulations are equivalent (up to reparameterization). In order to see this, consider the softmax function components for both classes:
\[ \pi_1(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_1^\top \mathbf{x})}{\exp(\boldsymbol{\theta}_1^\top \mathbf{x}) + \exp(\boldsymbol{\theta}_2^\top \mathbf{x})} \]
\[ \pi_2(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_2^\top \mathbf{x})}{\exp(\boldsymbol{\theta}_1^\top \mathbf{x}) + \exp(\boldsymbol{\theta}_2^\top \mathbf{x})} \]
Since we know that \(\pi_1(\mathbf{x}| \boldsymbol{\theta}) + \pi_2(\mathbf{x}| \boldsymbol{\theta}) = 1\), it is sufficient to compute one of the two scoring functions. Let’s pick \(\pi_1(\mathbf{x}| \boldsymbol{\theta})\) and relate it to the logistic function:
\[ \pi_1(\mathbf{x}| \boldsymbol{\theta}) = \frac{1}{1 + \exp(\boldsymbol{\theta}_2^\top \mathbf{x}- \boldsymbol{\theta}_1^\top \mathbf{x})} = \frac{1}{1 + \exp(-\boldsymbol{\theta}^\top \mathbf{x})} \]
where \(\boldsymbol{\theta}:= \boldsymbol{\theta}_1 - \boldsymbol{\theta}_2\). Thus, we obtain the binary-case logistic function, reflecting that we only need one scoring function (and thus one set of parameters \(\boldsymbol{\theta}\) rather than two \(\boldsymbol{\theta}_1, \boldsymbol{\theta}_2\)).Exercise 2: Hyperplanes
Linear classifiers like logistic regression learn a decision boundary that takes the form of a (linear) hyperplane. Hyperplanes are defined by equations \(\boldsymbol{\theta}^\top \mathbf{x}= b\) with coefficients \(\boldsymbol{\theta}\) and a scalar \(b \in \mathbb{R}\).
- In order to see that such expressions actually describe hyperplanes, consider \(\boldsymbol{\theta}^\top \mathbf{x}= \theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0\). Sketch the hyperplanes given by the following coefficients and explain the difference between the parameterizations:
- \(\theta_0 = 0, \theta_1 = \theta_2 = 1\)
- \(\theta_0 = 1, \theta_1 = \theta_2 = 1\)
- \(\theta_0 = 0, \theta_1 = 1, \theta_2 = 2\)
Solution
A hyperplane in 2D is just a line. We know that two points are sufficient to describe a line, so all we need to do is pick two points fulfilling the hyperplane equation.
- \(\theta_0 = 0, \theta_1 = \theta_2 = 1\) \(\rightsquigarrow\) e.g., (0, 0) and (1, -1). Sketch it:
- \(\theta_0 = 1, \theta_1 = \theta_2 = 1\) \(\rightsquigarrow\) e.g., (0, -1) and (1, -2). The change in \(\theta_0\) promotes a horizontal shift:
- \(\theta_0 = 0, \theta_1 = 1, \theta_2 = 2\) \(\rightsquigarrow\) e.g., (0, 0) and (1, -0.5). The change in \(\theta_2\) pivots the line around the intercept:
We see that a hyperplane is defined by the points that lie directly on it and thus fulfill the hyperplane equation.
Exercise 3: Decision Boundaries & Thresholds in Logisitc Regression
In logistic regression (binary case), we estimate the probability \(p(y = 1 | \mathbf{x}, \boldsymbol{\theta}) = \pi(\mathbf{x}| \boldsymbol{\theta})\). In order to decide about the class of an observation, we set \(\hat{y} = 1\) iff \(\pi(\mathbf{x}| \boldsymbol{\theta}) \geq \alpha\) for some \(\alpha \in (0, 1)\).
- Show that the decision boundary of the logistic classifier is a (linear) hyperplane.
Hint
Derive the value of \(\boldsymbol{\theta}^\top \mathbf{x}\) (depending on \(\alpha\)) starting from which you predict \(\hat{y} = 1\) rather than \(\hat{y} = 0\).
Solution
We evaluate
\[\begin{align*} \pi(\mathbf{x})&= \frac{1}{1 + \exp(-\boldsymbol{\theta}^\top \mathbf{x})} = \alpha \\ &\Leftrightarrow\ 1 + \exp(-\boldsymbol{\theta}^\top \mathbf{x}) = \frac{1}{\alpha} \\ &\Leftrightarrow\ \exp(-\boldsymbol{\theta}^\top \mathbf{x}) = \frac{1}{\alpha} - 1 \\ &\Leftrightarrow\ -\boldsymbol{\theta}^\top \mathbf{x}= \log \left(\frac{1}{\alpha} - 1 \right) \\ &\Leftrightarrow\ \boldsymbol{\theta}^\top \mathbf{x}= -\log \left(\frac{1}{\alpha} - 1 \right). \end{align*}\]
\(\boldsymbol{\theta}^\top \mathbf{x}= -\log \left(\frac{1}{\alpha} - 1 \right)\) is the equation of the linear hyperplane comprised of all linear combinations \(\boldsymbol{\theta}^\top \mathbf{x}\) that are equal to \(-\log \left(\frac{1}{\alpha} - 1\right)\). The equation therefore describes the decision rule for setting \(\hat y\) equal to 1 by taking all points that lie on or above this hyperplane.
- Below you see the logistic function for a binary classification problem with two input features for different values \(\boldsymbol{\theta}^\top = (\theta_1, \theta_2)^\top\) (plots 1-3) as well as \(\alpha\) (plot 4). What can you deduce for the values of \(\theta_1\), \(\theta_2\), and \(\alpha\)? What are the implications for classification in the different scenarios?
Solution
We observe
in plot (1): the logistic function runs parallel to the \(x_2\) axis, so it is the same for every value of \(x_2\). In other words, \(x_2\) does not contribute anything to the class discrimination and its associated parameter \(\theta_2\) is equal to 0.
in plot (2): both dimensions affect the logistic function – to equal degree in this case, meaning \(x_1\) and \(x_2\) are equally important. If \(\theta_1\) were larger than \(\theta_2\) or vice versa the hypersurface would be more tilted towards the respective axis. Furthermore, due to \(\theta_1\) and \(\theta_2\) being positive, \(\pi(\mathbf{x})\) increases with higher values for \(x_1\) and \(x_2\).
in plot (3): this is the same situation as in plot (2) but the logistic function is steeper, which is due to \(\theta_1, \theta_2\) having larger absolute values. We therefore get a sharper separation between classes (fewer predicted probability values close to 0.5, so we are overall more confident in our decision). As in plot (2), the increasing probability of \(\hat{y}=1\) for higher values of \(x_1\) and \(x_2\) indicates positive values for \(\theta_1\) and \(\theta_2\).
in plot (4): this is the same situation as in plot (1). The different values for \(\alpha\) represent different thresholds: a high value (leftmost line) means we only assign class 1 if the estimated class-1 probability is large. Conversely, a low value (rightmost line) signifies we are ready to predict class 1 at a low threshold – in effect, this is the same as the previous scenario, only the class labels are flipped. The mid line corresponds to the common case \(\alpha = 0.5\) where we assign class 1 as soon as the predicted probability is more than 50%.
- Derive the equation for the decision boundary hyperplane if we choose \(\alpha = 0.5\).
Solution
We make use of our results from a):
\[\begin{align*} \hat{y} = 1 &\Leftrightarrow \boldsymbol{\theta}^\top \mathbf{x}\geq -\log \left(\frac{1}{\alpha} - 1 \right) \\ &\Leftrightarrow \boldsymbol{\theta}^\top \mathbf{x}\geq -\log \left(\frac{1}{0.5} - 1 \right) \\ &\Leftrightarrow \boldsymbol{\theta}^\top \mathbf{x}\geq -\log 1 \\ &\Leftrightarrow \boldsymbol{\theta}^\top \mathbf{x}\geq 0. \end{align*}\]
The 0.5 threshold therefore leads to the coordinate hyperplane and divides the input space into the positive “1” halfspace where \(\boldsymbol{\theta}^\top \mathbf{x}\geq 0\) and the “0” halfspace where \(\boldsymbol{\theta}^\top \mathbf{x}< 0\).
- Explain when it might be sensible to set \(\alpha\) to 0.5.
Solution
When the threshold \(\alpha = 0.5\) is chosen, the losses of misclassified observations, i.e., \(L(\hat{y} = 0 ~|~ y = 1)\) and \(L(\hat{y} = 1 ~|~ y = 0)\), are treated equally, which is often the intuitive thing to do. It means \(\alpha = 0.5\) is a sensible threshold if we do not wish to avoid one type of misclassification more than the other. If, however, we need to be cautious to only predict class 1 if we are very confident (for example, when the decision triggers a costly therapy), it would make sense to set the threshold considerably higher.